Author: Manuele Nolli, student BSc Computer Science SUPSI
Date: 28.11.2022
Mail: manuele.nolli@student.supsi.ch
This document is an analysis of a public dataset found on Kaggle.com
The dataset contains 80k wine reviews with variety, location, winery, price, points, taster nam and description.
My analysis will focus on the following questions:
Whit the following code we can see the details of the dataset and how it is structured and the type of the columns.
---Dataset Info--- Total columns: 15 Columns names: country, description, designation, points, price, province, region_1, region_2, taster_name, taster_photo, taster_twitter_handle, title, variety, vintage, winery. Columns type:
| Types | NaN Count | |
|---|---|---|
| country | [<class 'str'>, <class 'float'>] | 5 |
| description | [<class 'str'>] | 0 |
| designation | [<class 'str'>, <class 'float'>] | 21319 |
| points | [<class 'int'>] | 0 |
| price | [<class 'float'>] | 4647 |
| province | [<class 'str'>, <class 'float'>] | 5 |
| region_1 | [<class 'float'>, <class 'str'>] | 12913 |
| region_2 | [<class 'float'>, <class 'str'>] | 49894 |
| taster_name | [<class 'str'>, <class 'float'>] | 150 |
| taster_photo | [<class 'str'>, <class 'float'>] | 150 |
| taster_twitter_handle | [<class 'str'>, <class 'float'>] | 1076 |
| title | [<class 'str'>] | 0 |
| variety | [<class 'str'>] | 0 |
| vintage | [<class 'str'>] | 0 |
| winery | [<class 'str'>] | 0 |
Dataframe rows: 81115 Dataset samples:
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_photo | taster_twitter_handle | title | variety | vintage | winery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44388 | England | Red currant notes are framed by zesty lemon an... | Traditional Method Rosé | 93 | 89.0 | England | NaN | NaN | Anne Krebiehl MW | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @AnneInVino | Gusbourne Estate 2015 Traditional Method Rosé ... | Sparkling Blend | 2015 | Gusbourne Estate |
| 45489 | New Zealand | Peach, pineapple, guava, tomato leaf and dried... | NaN | 88 | 13.0 | Marlborough | NaN | NaN | Christina Pickard | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @ckpickard | Waxing Moon 2018 Sauvignon Blanc (Marlborough) | Sauvignon Blanc | 2018 | Waxing Moon |
| 79022 | US | This is a high-strung wine tart in acidity and... | Rosé of | 85 | 20.0 | California | Sonoma County | Sonoma | Virginie Boone | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @vboone | Martin Ray 2018 Rosé of Pinot Noir (Sonoma Cou... | Pinot Noir | 2018 | Martin Ray |
| 11447 | US | Dried herb aromas are at the fore of this Cabe... | NaN | 91 | 26.0 | Washington | Columbia Valley (WA) | Columbia Valley | Sean P. Sullivan | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @wawinereport | Novelty Hill 2014 Cabernet Sauvignon (Columbia... | Cabernet Sauvignon | 2014 | Novelty Hill |
| 34141 | France | Light gold in color, this opens with an assert... | Terres | 89 | NaN | Languedoc-Roussillon | Pays d'Oc | NaN | Lauren Buzzeo | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @laurbuzz | Domaine de la Baume 2016 Terres Viognier (Pays... | Viognier | 2016 | Domaine de la Baume |
It is possible to see that the dataset contains 80k rows and 15 columns. The columns are:
In this section it is possible see the distribution of the wines across the continents. I used the country column to see the distribution of the wines across the continents. I decided to create a new column called continent that contains the continent of the country.
The following code shows the distribution of the wines across the continents trough a pie chart. It is possible to see that the majority of the wines are produced in Europe, followed by North America.
The above chart is an alternative way to see the distribution of the wines across the continents. It is more interactive and it is possible to see the exact number of wines produced in each continent, country and region.
Another interesting aspect of the dataset is the distribution of the points. The points are given by the tasters and they are on a scale from 80 to 100 and WineEnthusiast has another way to group the wine by 5 categories:
In the following section a new column called pointsDescription is created that contains the description of the score.
From this graph it is possible to see that the majority of the wines are in the Good category, followed by the very good category (the middles scores are the most common).
It is curious to see that there are more wines with 90 points than with 89 points. That is probably because the tasters are more likely to give a wine 90 points than 89 points to have the wine labeled as Excellent.
In this section it is possible to see the distribution of the vintage of the wines. The vintage is the year in which the grapes were harvested.
It must be remembered that the dataset contains wines reviewed beetwen 2017 and 2020. It is normal to see that the majority of the wines are from the past years. But, there are also some very old wines in the dataset. The oldest wine is from 1931 and surprisely it does not have a very high score.
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_photo | taster_twitter_handle | title | variety | vintage | winery | continent | pointsDescription | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2722 | Portugal | This remarkable wine looks old, and with its d... | Tinto | 89 | 550.0 | Colares | NaN | NaN | Roger Voss | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @vossroger | Adega Viuva Gomes 1931 Tinto Red (Colares) | Ramisco | 1931 | Adega Viuva Gomes | Europe | Very good |
In this section it is possible to see the distribution of the variety of the wines. The variety is the type of grapes used to make the wine (ie Pinot Noir). In the dataset there are many different varieties of wines but I decided to show only the top 10 varieties. It is possible to change this settings by changing the wineCountToShow variable.
Firstly, I created different versions of the dataset that thy will be used to create the graphs.
Now is finally the time to create the graphs. The left graph is a bar chart that shows the distribution of the wines, the center graph is another bar chart that shows the average points of the wines and the right graph is a box plot that shows the distribution of the prices of the wines.
It is interesting to see that the other varieties have a lot more reviews than the top 10 varieties, this means that the dataframe is well balanced.
There are two principal graph in this section, the first one show a box plot rappresenting the distribution of the prices by points and the second one show a percentage histogram of the prices grouped by a personal price description:
By looking at the box plot it is possible to see that the wines with the highest points are the most expensive as could be expected, so there is a strong connection between the price and the points. This is also confirmed by the following histogram that shows that the wines with the highest points are the most expensive.
It is curious to see that there are some wines with a very high price and a very low points and in the other side there are some wines with a very low price and a very high points. This means that the price is not the only factor that influence the points.
Note: I tried to create a graph object with the past two graph connected by the x-axis but it is currently not possible to do that with plotly. Further information: https://community.plotly.com/t/how-to-set-barmode-for-individual-subplots/47931
Now it is time to see the distribution of the reviewers. I am interested in seeing how many reviewers there are and how many reviews each of them has done. I also want to see if there are some reviewers that are more reliable than others and if there are some reviewers that are more likely to review wines from a specific continent.
There are different considerations to make:
In this section I decided to represent the most used words in the description of the wines for each point. I used the description column to extract the words after a cleaning process.
Classic
Acceptable
Excellent
Good
Superb
Very good